442 research outputs found

    Comparing a statistical and a rule-based tagger for German

    Full text link
    In this paper we present the results of comparing a statistical tagger for German based on decision trees and a rule-based Brill-Tagger for German. We used the same training corpus (and therefore the same tag-set) to train both taggers. We then applied the taggers to the same test corpus and compared their respective behavior and in particular their error rates. Both taggers perform similarly with an error rate of around 5%. From the detailed error analysis it can be seen that the rule-based tagger has more problems with unknown words than the statistical tagger. But the results are opposite for tokens that are many-ways ambiguous. If the unknown words are fed into the taggers with the help of an external lexicon (such as the Gertwol system) the error rate of the rule-based tagger drops to 4.7%, and the respective rate of the statistical taggers drops to around 3.7%. Combining the taggers by using the output of one tagger to help the other did not lead to any further improvement.Comment: 8 page

    Detecting Protein-Protein Interactions in Biomedical Literature Using a Parser

    Full text link
    We describe the task of automatically detecting interactions between proteins in biomedical literature. We use a syntactic parser, a corpus annotated for proteins, and manual decisions as training material. After automatically parsing the GENIA corpus, which is manually annotated for proteins, all syntactic paths between proteins are extracted. These syntactic paths are manually disambiguated between meaningful paths and irrelevant paths. Meaningful paths are paths that express an interaction between the syntactically connected proteins, irrelevant paths are paths that do not convey any interaction. The resource created by these manual decisions is used in two ways. First, words that appear frequently inside a meaningful path are learnt using simple machine learning. Second, these resources are applied to the task of automatically detecting interactions between proteins in biomedical literature

    Learning to Disambiguate Syntactic Relations

    Get PDF
    Natural Language is highly ambiguous, on every level. This article describes a fast broad-coverage state-of-the-art parser that uses a carefully hand-written grammar and probability-based machine learning approaches on the syntactic level. It is shown in detail which statistical learning models based on Maximum-Likelihood Estimation (MLE) can support a highly developed linguistic grammar in the disambiguation process

    Size dependence of the dielectric breakdown strength from nano- to millimeter scale

    Get PDF
    Dielectric breakdown decisively determines the reliability of nano- to centimeter-sized electronic devices and -components. A systematic investigation of the size-dependent dielectric breakdown strength reveals a thickness-independent intrinsic regime and a thickness-dependent extrinsic regime. Besides that the breakdown strength scales with the inverse square root of the permittivity. Only recently, the intrinsic breakdown strength could be theoretically explained by density functional theory calculations, which confirmed von Hippel’s electron avalanche model. This thickness dependence resembles the difference between an intrinsic mechanical strength and a -volume dependent defect size controlled Weibull mechanical strength distribution. Therefore, the hypothesis whether the thickness dependence of dielectric breakdown can be explained by a weakest link concept is discussed. Finally it is shown that the prevailing electrical conduction mechanism at the onset of dielectric -breakdown is most probably dominated by space charge injection. A Griffith type energy release rate breakdown model including space charge conductivity is presented, which allows for the explanation of the empirical results in the extrinsic regime

    Text-Mining-Methoden im Semantic Web

    Get PDF
    Zusammenfassungen: Aufbau, Pflege und Nutzung groβer Wissensdatenbanken erfordern den kombinierten Einsatz menschlicher und maschineller Informationsverarbeitung. Da groβe Teile des menschlichen Wissens in Textform vorliegen, bieten sich Methoden des Text Mining zur Extraktion von Wissensinhalten an. Dieser Artikel behandelt Grundlagen des Text Mining im Kontext des Semantic Web. Methoden des Text Mining werden besprochen, die für die halbautomatische Annotierung von Texten und Textteilen eingesetzt werden, insbesondere Eigennamenerkennung (Named-Entity Recognition), automatische Schlüsselworterkennung (Keyword Recognition), automatische Dokumentenklassifikation, teilautomatisches Erstellen von Ontologien und halbautomatische Faktenerkennung (Fact Recognition, Event Recognition). Es werden auch kritische Hintergrundfragen aufgegriffen. Das Problem der zu hohen Fehlerrate und der zu geringen Performanz automatischer Verfahren wird diskutiert. Zwei Beispiele aus der Praxis werden vorgestellt: Erstens das Forschungsprojekt OntoGene der Universität Zürich, in dem Protein-Protein-Interaktionen als Relationstripel aus der Fachliteratur extrahiert werden, und zweitens ein ontologiebasierter Tag-Recommender, der die manuelle Vergabe von Schlüsselwörtern an Wissensressourcen unterstütz

    Comparing the coverage of the “marriage for all” vote on Twitter and in the newspapers

    Full text link
    This paper investigates the differences in the Twitter and newspapers coverage of the “marriage for all” popular vote that took place in Switzerland in 2021. More precisely, we ask the following questions: How salient were discussions about the marriage for all on Twitter and in the newspapers? What major arguments were mobilized in both media? How were these arguments received (i.e. retweets, likes, replies)? We extracted publicly available tweets from users involved in the debate and news articles containing specific keywords. These text data have been automatically analyzed to find major views and topics of discussions using keyword and collocation analyses, as well as topic modelling. Results show that criticism of marriage for all is clearly in the minority, but there is strong polarization over whether same-sex couples should be allowed to adopt or have children through sperm donation

    Comparison of Canopy Openness in Different Cocoa (Theobroma cacao) Production Systems in Alto Beni, Bolivia

    Get PDF
    Cocoa (Theobroma cacao L.) grows naturally as an understory tree in tropical forests and produces well under shaded and non-shaded conditions. It is cultivated by small scale farmers in South America under various conditions, ranging from monocultures to different kinds of agroforestry systems. While in monocultures it is exposed to direct sunlight, one or various tree species shade the cocoa in agroforestry systems. Also organic cocoa cultivation is becoming more and more popular due to premium prices and increasing ecological consciousness. In Alto Beni, Bolivia, the Research Institute of Organic Agriculture (FiBL) and local partners have established a long-term field trial to compare cocoa production systems. The bi-factorial randomised block design includes management and biodiversity factors combined to the following five cocoa treatments: monoculture and agroforestry systems both under organic and conventional management, and successional agroforestry system (high plant species diversity) under organic management and for further comparison fallow plots of same age as the cocoa plots. Research is done in all fields of agronomic, economic and environmental interest. This study focuses on the comparison of the canopy openness of the different cocoa production systems and fallow plots. Knowledge about the canopy openness enables the estimation of light entering the production system, especially on the cocoa layer (photosynthesis relevant) and also on the soil as canopy openness influences the microclimate in the plantation. Another aspect of the canopy is the impact on the throughfall within the plot. Over the time, variations in the canopy structure indicate the production of biomass, of nutrient enrichment by throughfall (rain-wash and nutrient leaf leaching in the canopy) and may indicate pruning necessities when the plant cover above the cocoa exceeds critical values. To estimate the canopy openness, in the years 2012 and 2013 hemispherical photography was taken with fisheye lenses in the different cocoa production systems and in the fallow plots. The photos were analysed with the programme Gap Light Analyser. First results of canopy openness between the cocoa systems will be shown and discussed for leave area index and potential microclimate differences

    Scaling Native Language Identification with Transformer Adapters

    Full text link
    Native language identification (NLI) is the task of automatically identifying the native language (L1) of an individual based on their language production in a learned language. It is useful for a variety of purposes including marketing, security and educational applications. NLI is usually framed as a multi-label classification task, where numerous designed features are combined to achieve state-of-the-art results. Recently deep generative approach based on transformer decoders (GPT-2) outperformed its counterparts and achieved the best results on the NLI benchmark datasets. We investigate this approach to determine the practical implications compared to traditional state-of-the-art NLI systems. We introduce transformer adapters to address memory limitations and improve training/inference speed to scale NLI applications for production

    Assessing How Attitudes to Migration in Social Media Complement Public Attitudes Found in Opinion Surveys

    Full text link
    This article compares migration discourses in traditional opinion surveys and social media in a cross-country perspective among five Englishspeaking countries. Despite the extensive survey research on migration, social media discussions on migration remain understudied, and little is known about its potential complementarity to survey findings. On the basis of automated content analysis, we present insights into the salience of and sentiment about migration by comparing both data sources. We also investigate which societal factors and framing of migration influence the salience of social media discussions. We find support that, overall, there is a good correlation between salience of and sentiment toward migration, both in surveys and on social media. We also demonstrate that societal factors significantly impact the salience of migration online. The observed dynamics may nevertheless differ depending on the sample of users, thus demonstrating the different incentives that motivate users to engage with the migration topic online. Methodologically, our contribution also demonstrates the necessity to reflect on the impact of different data collection strategies on the obtained findings

    Hypothesis Engineering for Zero-Shot Hate Speech Detection

    Full text link
    Standard approaches to hate speech detection rely on sufficient available hate speech annotations. Extending previous work that repurposes natural language inference (NLI) models for zero-shot text classification, we propose a simple approach that combines multiple hypotheses to improve English NLI-based zero-shot hate speech detection. We first conduct an error analysis for vanilla NLI-based zero-shot hate speech detection and then develop four strategies based on this analysis. The strategies use multiple hypotheses to predict various aspects of an input text and combine these predictions into a final verdict. We find that the zero-shot baseline used for the initial error analysis already outperforms commercial systems and fine-tuned BERT-based hate speech detection models on HateCheck. The combination of the proposed strategies further increases the zero-shot accuracy of 79.4% on HateCheck by 7.9 percentage points (pp), and the accuracy of 69.6% on ETHOS by 10.0pp
    corecore